1 Introduction

This markdown is for analyzing the Allen Brain Atlas MTG (middle temporal gyrus) dataset. Some details of their analysis is provided here. Briefly, these data were generated using SMART-Seq v4 Ultra Low Input RNA Kit, which is an improved version of SMART-seq2 protocol.

This pipeline using some alternative strategies for data processing and analysis, mostly based on bioconductor workflows for scRNAseq.

The packages required for the analysis are as follows: - scater: collection of tools for doing quality control analyses of scRNA-seq - scran: methods provide normalization of cell-specific biases, correcting batch effects, identify marker genes - SC3: package for single cell consensus clustering.

1.1 Loading libraries.

bio_packs = c("SingleCellExperiment","biomaRt","scater","scran","SC3")
source("https://bioconductor.org/biocLite.R")
# biocLite("BiocUpgrade",suppressUpdates = TRUE)
for(pack1 in bio_packs){
    if( !pack1 %in% installed.packages()[,"Package"]){
        biocLite(pack1, suppressUpdates = TRUE)
    }
}

cran_packs = c("data.table", "svd", "Rtsne")
for(pack1 in cran_packs){
    if( !pack1 %in% installed.packages()[,"Package"]){
        install.packages(pack1)
    }
}

library(SingleCellExperiment)
library(scater)
library(scran)
library(limma)
library(data.table)
library(svd)
library(Rtsne)
library(SC3)

2 Obtaining/Loading Counts

Before running this R markdown, please first download the datafile human_MTG_gene_expression_matrices_2018-06-14.zip from http://celltypes.brain-map.org/api/v2/well_known_file_download/694416044 to your local computer, and unzip the file. Then in the following code chunk, set the working directory to be the directory of the unzipped folder.

You can modify the following code to determine if the analysis will be conducted with exon counts only(exon_only = TRUE) or with both exon and intron counts summed together (exon_only = FALSE). We have requested 7 cores (num_cores = 7) to improve SC3 runtime.

# work_dir = "~/research/scRNAseq/data/Allen_BI/"
# work_dir = paste0(work_dir, "human_MTG_gene_expression_matrices_2018-06-14")
# repo_dir = "~/research/GitHub/scRNAseq_pipelines"

repo_dir = "/pine/scr/p/l/pllittle/CS_eQTL/s3_Real/scRNAseq_pipelines"
work_dir = file.path(repo_dir,"MTG")
source(file.path(repo_dir,"SOURCE.R"))

exon_only = TRUE
run_sc3   = !TRUE
num_cores = ifelse(run_sc3,7,1)

exons_fn    = "human_MTG_2018-06-14_exon-matrix.csv"
introns_fn  = "human_MTG_2018-06-14_intron-matrix.csv"

The following code read-in count data. To be compatible with other studies that only use count data from exonic regions, here we just read the exon counts.

counts = data.table::fread(file.path(work_dir,exons_fn),data.table = FALSE)
dim(counts)
## [1] 50281 15929
counts[1:2,1:2]
##       V1 F1S4_160106_001_B01
## 1 353007                   0
## 2 353008                   0
rownames(counts) = counts$V1
counts = as.matrix(counts[,-1])

if( !exon_only ) {
    intron_counts = fread(file.path(work_dir, introns_fn), data.table=FALSE)
    rownames(intron_counts) = intron_counts$V1
    intron_counts = as.matrix(intron_counts[,-1])
    counts = counts + intron_counts
}

Next we add the sample/cell information and gene information. Spike-ins were used for this data, though in this final count matrix, the spike-ins were not included. We generate an object of SingleCellExperiment named as sce, and then delete those original data.

cell_data = data.table::fread(file.path(work_dir,"human_MTG_2018-06-14_samples-columns.csv"))

dim(cell_data)
## [1] 15928    34
cell_data[1:2,]
##            sample_name sample_id sample_type     organism     donor sex
## 1: F1S4_160106_001_B01 556012415      Nuclei Homo sapiens H200.1030   M
## 2: F1S4_160106_001_C01 556012410      Nuclei Homo sapiens H200.1030   M
##    age_days brain_hemisphere brain_region brain_subregion facs_date
## 1:    19710                L          MTG              L5  1/6/2016
## 2:    19710                L          MTG              L5  1/6/2016
##     facs_container facs_sort_criteria rna_amplification_set
## 1: F1S4_160106_001      NeuN-positive        A8S4_160401_02
## 2: F1S4_160106_001      NeuN-positive        A8S4_160401_02
##    library_prep_set library_prep_avg_size_bp           seq_name seq_tube
## 1:   L8S4_160406_02                      429 LS-15051_S02_E1-50 LS-15051
## 2:   L8S4_160406_02                      448 LS-15051_S03_E1-50 LS-15051
##        seq_batch total_reads percent_exon_reads percent_intron_reads
## 1: R8S4-160411-H     2572946           45.08065             43.05855
## 2: R8S4-160411-H     2755839           34.91697             50.92145
##    percent_intergenic_reads percent_rrna_reads percent_mt_exon_reads
## 1:                 11.86080                  0            0.09005241
## 2:                 14.16158                  0            0.05247041
##    percent_reads_unique percent_synth_reads percent_ecoli_reads
## 1:             85.57370         0.007534554         0.018536728
## 2:             87.61978         0.002790439         0.007264575
##    percent_aligned_reads_total complexity_cg genes_detected_cpm_criterion
## 1:                    92.41888     0.3166372                         8635
## 2:                    94.00132     0.2879887                        11697
##    genes_detected_fpkm_criterion         class             cluster
## 1:                          5253     GABAergic Inh L4-6 SST B3GAT2
## 2:                          8246 Glutamatergic Exc L5-6 RORB TTC12
table(cell_data$class)
## 
##     GABAergic Glutamatergic      no class  Non-neuronal 
##          4164         10525           325           914
table(cell_data$cluster)
## 
##    Astro L1-2 FGFR3 GFAP Astro L1-6 FGFR3 SLC14A1        Endo L2-6 NOSTRIN 
##                       61                      230                        9 
##         Exc L2 LAMP5 LTK Exc L2-3 LINC00507 FREM3 Exc L2-4 LINC00507 GLP2R 
##                      812                     2284                      170 
##    Exc L3-4 RORB CARM1P1    Exc L3-5 RORB COL22A1       Exc L3-5 RORB ESR1 
##                      280                      160                     1428 
##    Exc L3-5 RORB FILIP1L     Exc L3-5 RORB TWIST2     Exc L4-5 FEZF2 SCN4B 
##                      153                       93                       25 
##      Exc L4-5 RORB DAPK2     Exc L4-5 RORB FOLH1B      Exc L4-6 FEZF2 IL26 
##                      173                      870                      344 
##        Exc L4-6 RORB C1R     Exc L4-6 RORB SEMA3E       Exc L5-6 FEZF2 ABO 
##                      160                      777                      373 
##  Exc L5-6 FEZF2 EFTUD1P1      Exc L5-6 RORB TTC12    Exc L5-6 SLC17A7 IL15 
##                      314                      167                       56 
##    Exc L5-6 THEMIS C1QL3   Exc L5-6 THEMIS CRABP1  Exc L5-6 THEMIS DCSTAMP 
##                     1537                      147                       53 
##    Exc L5-6 THEMIS FGF10       Exc L6 FEZF2 OR2T8      Exc L6 FEZF2 SCUBE1 
##                       78                       19                       52 
##        Inh L1 SST CHRNA4          Inh L1 SST NMBR       Inh L1-2 GAD1 MC4R 
##                       52                      283                      107 
##       Inh L1-2 LAMP5 DBP      Inh L1-2 PAX6 CDH12  Inh L1-2 PAX6 TNFAIP8L3 
##                       21                       90                       16 
##       Inh L1-2 SST BAGE2         Inh L1-2 VIP LBH      Inh L1-2 VIP PCDH20 
##                      108                       47                       61 
##     Inh L1-2 VIP TSPAN12       Inh L1-3 PAX6 SYT6       Inh L1-3 SST CALB1 
##                       42                       29                      279 
##    Inh L1-3 VIP ADAMTSL1     Inh L1-3 VIP CCDC184       Inh L1-3 VIP CHRM2 
##                       72                       64                      175 
##         Inh L1-3 VIP GGH      Inh L1-4 LAMP5 LCP2      Inh L1-4 VIP CHRNA6 
##                       68                      356                       25 
##       Inh L1-4 VIP OPRM1        Inh L1-4 VIP PENK       Inh L2-3 VIP CASC6 
##                       52                       17                       45 
##     Inh L2-4 PVALB WFDC2        Inh L2-4 SST FRZB       Inh L2-4 VIP CBLN1 
##                      387                       64                       67 
##      Inh L2-4 VIP SPAG17    Inh L2-5 PVALB SCUBE3    Inh L2-5 VIP SERPINF1 
##                       33                       32                       55 
##         Inh L2-5 VIP TYR       Inh L2-6 LAMP5 CA1        Inh L2-6 VIP QPCT 
##                       62                      256                       37 
##      Inh L3-5 SST ADGRG6        Inh L3-6 SST HPGD         Inh L3-6 SST NPY 
##                      132                       60                       15 
##    Inh L3-6 VIP HS3ST3A1      Inh L4-5 PVALB MEPE      Inh L4-5 SST STK32A 
##                       80                       64                       93 
##     Inh L4-6 PVALB SULF1      Inh L4-6 SST B3GAT2      Inh L4-6 SST GXYLT2 
##                      167                      182                       41 
##      Inh L5-6 GAD1 GLP1R      Inh L5-6 PVALB LGR5     Inh L5-6 SST KLHDC8A 
##                       27                       52                       63 
##    Inh L5-6 SST MIR548F2     Inh L5-6 SST NPM1P10          Inh L5-6 SST TH 
##                       80                       79                       27 
##        Micro L1-3 TYROBP                 no class        Oligo L1-6 OPALIN 
##                       63                      325                      313 
##          OPC L1-6 PDGFRA 
##                      238
cell_data$cell_type = sapply(strsplit(cell_data$cluster, split=" "), "[", 1)
table(cell_data$cell_type)
## 
## Astro  Endo   Exc   Inh Micro    no Oligo   OPC 
##   291     9 10525  4164    63   325   313   238
cell_data$cell_type[which(cell_data$cell_type == "no")] = "unknown"

all(colnames(counts) == cell_data$sample_name)
## [1] TRUE
sce = SingleCellExperiment(assays = list(counts = counts), colData = cell_data)
rm(counts, cell_data)

Import gene infomration and add them into the sce object. We also check for spike ins. Those gene names starting with ERCC are indeed gene names rather than labels for spike-ins.

gene_dat = smart_RT(file.path(work_dir,"human_MTG_2018-06-14_genes-rows.csv"), 
                    sep=',', header=TRUE)
dim(gene_dat)
## [1] 50281     5
gene_dat[1:3,]
##      gene chromosome entrez_id
## 1 3.8-1.2          6    353007
## 2 3.8-1.3          6    353008
## 3 3.8-1.4          6    353009
##                                              gene_name mouse_homologenes
## 1 HLA complex group 26 (non-protein coding) pseudogene                  
## 2 HLA complex group 26 (non-protein coding) pseudogene                  
## 3 HLA complex group 26 (non-protein coding) pseudogene
all(rownames(sce) == as.character(gene_dat$entrez_id))
## [1] TRUE
rowData(sce)  = gene_dat
rownames(sce) = rowData(sce)$gene
sce
## class: SingleCellExperiment 
## dim: 50281 15928 
## metadata(0):
## assays(1): counts
## rownames(50281): 3.8-1.2 3.8-1.3 ... bA255A11.4 bA395L14.12
## rowData names(5): gene chromosome entrez_id gene_name
##   mouse_homologenes
## colnames(15928): F1S4_160106_001_B01 F1S4_160106_001_C01 ...
##   F2S4_170405_060_F01 F2S4_170405_060_H01
## colData names(35): sample_name sample_id ... cluster cell_type
## reducedDimNames(0):
## spikeNames(0):
rm(gene_dat)

rowData(sce)[grep("^ERCC", rowData(sce)$gene),]
## DataFrame with 10 rows and 5 columns
##                    gene  chromosome entrez_id
##             <character> <character> <integer>
## ERCC1             ERCC1          19      2067
## ERCC2             ERCC2          19      2068
## ERCC3             ERCC3           2      2071
## ERCC4             ERCC4          16      2072
## ERCC5             ERCC5          13      2073
## ERCC6             ERCC6          10      2074
## ERCC6-PGBD3 ERCC6-PGBD3          10 101243544
## ERCC6L           ERCC6L           X     54821
## ERCC6L2         ERCC6L2           9    375748
## ERCC8             ERCC8           5      1161
##                                                        gene_name
##                                                      <character>
## ERCC1              excision repair cross-complementation group 1
## ERCC2              excision repair cross-complementation group 2
## ERCC3              excision repair cross-complementation group 3
## ERCC4              excision repair cross-complementation group 4
## ERCC5              excision repair cross-complementation group 5
## ERCC6              excision repair cross-complementation group 6
## ERCC6-PGBD3                              ERCC6-PGBD3 readthrough
## ERCC6L        excision repair cross-complementation group 6-like
## ERCC6L2     excision repair cross-complementation group 6-like 2
## ERCC8              excision repair cross-complementation group 8
##             mouse_homologenes
##                   <character>
## ERCC1                   Ercc1
## ERCC2                   Ercc2
## ERCC3                   Ercc3
## ERCC4                   Ercc4
## ERCC5                   Ercc5
## ERCC6                   Ercc6
## ERCC6-PGBD3                  
## ERCC6L                 Ercc6l
## ERCC6L2               Ercc6l2
## ERCC8                   Ercc8
sce
## class: SingleCellExperiment 
## dim: 50281 15928 
## metadata(0):
## assays(1): counts
## rownames(50281): 3.8-1.2 3.8-1.3 ... bA255A11.4 bA395L14.12
## rowData names(5): gene chromosome entrez_id gene_name
##   mouse_homologenes
## colnames(15928): F1S4_160106_001_B01 F1S4_160106_001_C01 ...
##   F2S4_170405_060_F01 F2S4_170405_060_H01
## colData names(35): sample_name sample_id ... cluster cell_type
## reducedDimNames(0):
## spikeNames(0):

3 Pre-processing: Quality Control, Gene Detection, Normalization

3.1 Incorporate Mito and Ribo to calculate QC metrics

Next step we apply QC based on a set of features per cell. The code and filtering below are motivated by the vignette presented here.

sort(names(colData(sce)))
##  [1] "age_days"                      "brain_hemisphere"             
##  [3] "brain_region"                  "brain_subregion"              
##  [5] "cell_type"                     "class"                        
##  [7] "cluster"                       "complexity_cg"                
##  [9] "donor"                         "facs_container"               
## [11] "facs_date"                     "facs_sort_criteria"           
## [13] "genes_detected_cpm_criterion"  "genes_detected_fpkm_criterion"
## [15] "library_prep_avg_size_bp"      "library_prep_set"             
## [17] "organism"                      "percent_aligned_reads_total"  
## [19] "percent_ecoli_reads"           "percent_exon_reads"           
## [21] "percent_intergenic_reads"      "percent_intron_reads"         
## [23] "percent_mt_exon_reads"         "percent_reads_unique"         
## [25] "percent_rrna_reads"            "percent_synth_reads"          
## [27] "rna_amplification_set"         "sample_id"                    
## [29] "sample_name"                   "sample_type"                  
## [31] "seq_batch"                     "seq_name"                     
## [33] "seq_tube"                      "sex"                          
## [35] "total_reads"
sort(names(rowData(sce)))
## [1] "chromosome"        "entrez_id"         "gene"             
## [4] "gene_name"         "mouse_homologenes"
sce = calculateQCMetrics(sce)
sort(names(colData(sce)))
##  [1] "age_days"                       "brain_hemisphere"              
##  [3] "brain_region"                   "brain_subregion"               
##  [5] "cell_type"                      "class"                         
##  [7] "cluster"                        "complexity_cg"                 
##  [9] "donor"                          "facs_container"                
## [11] "facs_date"                      "facs_sort_criteria"            
## [13] "genes_detected_cpm_criterion"   "genes_detected_fpkm_criterion" 
## [15] "is_cell_control"                "library_prep_avg_size_bp"      
## [17] "library_prep_set"               "log10_total_counts"            
## [19] "log10_total_features_by_counts" "organism"                      
## [21] "pct_counts_in_top_100_features" "pct_counts_in_top_200_features"
## [23] "pct_counts_in_top_50_features"  "pct_counts_in_top_500_features"
## [25] "percent_aligned_reads_total"    "percent_ecoli_reads"           
## [27] "percent_exon_reads"             "percent_intergenic_reads"      
## [29] "percent_intron_reads"           "percent_mt_exon_reads"         
## [31] "percent_reads_unique"           "percent_rrna_reads"            
## [33] "percent_synth_reads"            "rna_amplification_set"         
## [35] "sample_id"                      "sample_name"                   
## [37] "sample_type"                    "seq_batch"                     
## [39] "seq_name"                       "seq_tube"                      
## [41] "sex"                            "total_counts"                  
## [43] "total_features_by_counts"       "total_reads"
sort(names(rowData(sce)))
##  [1] "chromosome"            "entrez_id"            
##  [3] "gene"                  "gene_name"            
##  [5] "is_feature_control"    "log10_mean_counts"    
##  [7] "log10_total_counts"    "mean_counts"          
##  [9] "mouse_homologenes"     "n_cells_by_counts"    
## [11] "pct_dropout_by_counts" "total_counts"
par(mfrow=c(3,2),mar=c(5, 4, 1, 1), bty="n",cex=0.9)

smart_hist(log10(sce$total_counts),xlab="log10(Library sizes)",
    main="",breaks=40,ylab="Number of cells")

smart_hist(log10(sce$total_features_by_counts),xlab="log10(# of expressed genes)", 
    main="",breaks=40,ylab="Number of cells")

smart_hist(sce$percent_rrna_reads, xlab="Ribosome prop. (%)",
    ylab="Number of cells", breaks=40,main="")

smart_hist(sce$percent_mt_exon_reads, xlab="Mitochondrial prop. (%)", 
    ylab="Number of cells", breaks=80,main="")

smart_hist(sce$percent_synth_reads, xlab="Synthetic reads (%)",
    ylab="Number of cells", breaks=40,main="")

smart_hist(sce$percent_reads_unique, xlab="Unique reads (%)", 
    ylab="Number of cells", breaks=80,main="")

par(mfrow=c(3,2),mar=c(5, 4, 1, 1),bty="n",cex=0.9)

plot(log10(sce$total_counts),log10(sce$total_features_by_counts), 
    xlab="log10(Library sizes)",ylab="log10(# of expressed genes)", 
    pch=20,cex=0.5,col=rgb(0,0,0,0.5))

plot(log10(sce$total_counts),sce$percent_synth_reads, 
    xlab="log10(Library sizes)",ylab="synth reads percent (%)",
    pch=20,cex=0.5,col=rgb(0,0,0,0.5))

plot(log10(sce$total_counts),sce$percent_reads_unique, 
    xlab="log10(Library sizes)", ylab="unique reads percent (%)",
    pch=20,cex=0.5,col=rgb(0,0,0,0.5))

plot(sce$percent_exon_reads,sce$percent_reads_unique, 
    xlab="exon reads percent (%)", ylab="unique reads percent (%)",
    pch=20,cex=0.5,col=rgb(0,0,0,0.5))

plot(sce$percent_aligned_reads_total,sce$percent_reads_unique, 
    xlab="aligned reads percent (%)",ylab="unique reads percent (%)",
    pch=20,cex=0.5,col=rgb(0,0,0,0.5))

plot(sce$genes_detected_fpkm_criterion,sce$percent_reads_unique, 
    xlab="detected genes (%)",ylab="unique reads percent (%)",
    pch=20,cex=0.5,col=rgb(0,0,0,0.5))

# Removing outliers defined as being percent_reads_unique lower than 50% 
keep = sce$percent_reads_unique > 50
table(keep)
## keep
## FALSE  TRUE 
##    70 15858
table(colData(sce)$cell_type, keep)
##          keep
##           FALSE  TRUE
##   Astro       3   288
##   Endo        0     9
##   Exc        52 10473
##   Inh        13  4151
##   Micro       0    63
##   Oligo       0   313
##   OPC         0   238
##   unknown     2   323

We then subset the sce object to keep high quality samples(cells).

sce = sce[,keep]
dim(sce)
## [1] 50281 15858

3.2 Summarize gene-level information

We keep those genes that are expressed in at least 30 cells, which is roughly 0.2% of the cells. This match to the goal of this study, to detect cell types as rare as 0.2% of all the cells.

rowData(sce)[1:2,]
## DataFrame with 2 rows and 12 columns
##                gene  chromosome entrez_id
##         <character> <character> <integer>
## 3.8-1.2     3.8-1.2           6    353007
## 3.8-1.3     3.8-1.3           6    353008
##                                                    gene_name
##                                                  <character>
## 3.8-1.2 HLA complex group 26 (non-protein coding) pseudogene
## 3.8-1.3 HLA complex group 26 (non-protein coding) pseudogene
##         mouse_homologenes is_feature_control         mean_counts
##               <character>          <logical>           <numeric>
## 3.8-1.2                                FALSE 0.00878955298844802
## 3.8-1.3                                FALSE  0.0573204419889503
##           log10_mean_counts n_cells_by_counts pct_dropout_by_counts
##                   <numeric>         <integer>             <numeric>
## 3.8-1.2 0.00380057604028069                 7      99.9560522350578
## 3.8-1.3   0.024206628837133                21      99.8681567051733
##         total_counts log10_total_counts
##            <integer>          <numeric>
## 3.8-1.2          140   2.14921911265538
## 3.8-1.3          913   2.96094619573383
summary(rowData(sce)$mean_counts)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.00      0.02      0.31     15.13      5.56 101644.02
table(rowData(sce)$mean_counts == 0)
## 
## FALSE  TRUE 
## 48278  2003
summary(rowData(sce)$n_cells_by_counts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      30     244    1744    1893   15928
table(colData(sce)$cell_type)
## 
##   Astro    Endo     Exc     Inh   Micro   Oligo     OPC unknown 
##     288       9   10473    4151      63     313     238     323
par(mfrow=c(2,2), mar=c(5,4,1,1))
smart_hist(log10(rowData(sce)$mean_counts+1e-6),main="",
    breaks=40, xlab="log10(ave # of UMI + 1e-6)")
smart_hist(log10(rowData(sce)$n_cells_by_counts+1),main="",
    breaks=40, xlab="log10(# of expressed cells + 1)")
smoothScatter(log10(rowData(sce)$mean_counts+1e-6),
    log10(rowData(sce)$n_cells_by_counts + 1),
    xlab="log10(ave # of UMI + 1e-6)",
    ylab="log10(# of expressed cells + 1)")

tb1 = table(rowData(sce)$n_cells_by_counts)
tb1[1:11]
## 
##    0    1    2    3    4    5    6    7    8    9   10 
## 2003  567  555  559  662  574  539  562  494  441  445
ncol(sce)*0.002
## [1] 31.716
min_detect_min_sample = rowSums(counts(sce) > 0) > 30
table(min_detect_min_sample)
## min_detect_min_sample
## FALSE  TRUE 
## 12624 37657
min_mean_counts05 = rowData(sce)$mean_counts > 0.5
table(min_mean_counts05)
## min_mean_counts05
## FALSE  TRUE 
## 27404 22877
table(min_detect_min_sample,min_mean_counts05)
##                      min_mean_counts05
## min_detect_min_sample FALSE  TRUE
##                 FALSE 12624     0
##                 TRUE  14780 22877
sce = sce[which(min_detect_min_sample),]
dim(sce)
## [1] 37657 15858
# Next we check those highly expressed genes 
par(mfrow=c(1,2),mar=c(5,8,1,1))

od1 = order(rowData(sce)$mean_counts, decreasing = TRUE)
barplot(rowData(sce)$mean_counts[od1[20:1]], las=1, 
    names.arg=rowData(sce)$gene[od1[20:1]], 
    horiz=TRUE, cex.names=1, cex.axis=0.7, 
    xlab="ave # of UMI")
barplot(log10(rowData(sce)$mean_counts[od1[20:1]]), las=1, 
    names.arg=rowData(sce)$gene[od1[20:1]], 
    horiz=TRUE, cex.names=1, cex.axis=0.7, 
    xlab="log10(ave # of UMI)")

saveRDS(sce,file.path(work_dir,"post_gene_filter.rds"))
gc()
##             used   (Mb) gc trigger   (Mb)   max used   (Mb)
## Ncells   5667496  302.7   10052351  536.9    8403104  448.8
## Vcells 350111536 2671.2 1199268152 9149.7 1250756922 9542.6
# To load image
# sce = readRDS(file.path(work_dir,"post_gene_filter.rds"))

3.3 Normalization

A simple solution for normalization and stablizing expression variance across genes is to tranform the count data by log(count/size.factor + 1). One may calcualte size.factor per cell as the total number of reads/UMIs, and this assumes the total expression are the same across all the cells. However, the total expression of each cell may vary with respect to cell type and/or cell size, and the computeSumFactors function in R package scran provides a more sophisicated way to calcualte size.factor to allow such variation across cells (Lun, Bach, and Marioni 2016). computeSumFactors can use initial clustering of cells to normalize expression within and beetween clusters. Within a cluster, it estimates the size factor for many groups of cells so that there are more groups than cells, and then it can calcualte the size factor per cell using a linear deconvolution system.

This method will fail (i.e., giving negative size factors) if there are too many zeros in the count data. Therefore it is necessesary to remove “genes with a library size-adjusted average count below a specified threshold” using the parameter min.mean. See here for more explanations.

min_mean = 1

date()
## [1] "Sun Feb 17 21:51:55 2019"
clusters = quickCluster(sce, min.mean=min_mean, method="igraph")
table(clusters)
## clusters
##    1    2    3    4    5    6    7    8    9 
## 1422  254  236 2335  791  351 3481 4356 2632
date()
## [1] "Sun Feb 17 21:55:52 2019"
sce = computeSumFactors(sce, cluster=clusters, min.mean=min_mean)
date()
## [1] "Sun Feb 17 22:02:55 2019"
summary(sizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05659 0.81609 0.99811 1.00000 1.16893 7.05769

As shown in the following plot, the final size factor estimation is indeed highly correlated with the naive definition by total count.

par(mfrow=c(1,1), mar=c(5,4,2,1), bty="n")
plot(sce$total_counts, sizeFactors(sce), log="xy", 
    xlab="total counts", ylab="size factors", 
    cex=0.3, pch=20, col=rgb(0.1,0.2,0.7,0.3))

Finally, the command normalize(sce) adds the normalized expression into the variable sce.

sce = normalize(sce)

4 Dimension reduction

For dimension reduction, such as calculating PCA or performing TSNE, we should start by identifying a subset of genes with high level of biological signal relative to background (technical) noise. We start by examinng mean-variance relationship.

Since the information of individual spike-ins have been removed from this MTG dataset, we implemented the code below similar to here.

date()
## [1] "Sun Feb 17 22:03:13 2019"
new_trend = makeTechTrend(x=sce)
date()
## [1] "Sun Feb 17 22:09:51 2019"
fit = trendVar(sce, use.spikes=FALSE, loess.args=list(span=0.05))
date()
## [1] "Sun Feb 17 22:10:18 2019"
par(mfrow=c(1,1), mar=c(5,4,2,1), bty="n")
plot(fit$mean, fit$vars, pch=20, col=rgb(0.1,0.2,0.7,0.6), 
    xlab="log(mean)", ylab="var")
curve(new_trend(x), col="red", lwd=2, add=TRUE)
curve(fit$trend(x), col="orange", lwd=2, add=TRUE)
legend("topright", legend=c("Poisson noise", "observed trend"), 
    lty=1, lwd=2, col=c("red", "orange"), bty="n")

The above function makeTechTrend() assumes a Poisson model and from the above plot, it is clearly not a suitable fit. So we will keep fit equal to the loess fit from trendVar() rather than setting it equal to the output from makeTechTrend().

In the following code, we used the decomposeVar function from R/cran to estimate the technical/biological component of variance of each gene.

The technical component of the variance for each gene is determined by interpolating the fitted trend in fit at the mean log-count for that gene. This represents variance due to sequencing noise, variability in capture efficiency, etc. The biological component is determined by subtracting the technical component from the total variance. Highly variable genes (HVGs) can be identified as those with large biological components

dec     = decomposeVar(fit=fit)
top_dec = dec[order(dec$bio,decreasing=TRUE),]
top_dec[1:10,]
## DataFrame with 10 rows and 6 columns
##                        mean            total              bio
##                   <numeric>        <numeric>        <numeric>
## ENC1       6.09953103432473 17.6135525047028 8.22185678948232
## GAD1       2.61037642718004  17.482161519494 7.93040403516013
## SPARCL1    7.17861942532271 14.9308385190744 7.22008150620191
## SLC6A1     2.97727719686037  17.291562981475 7.11731234153986
## CHN1       7.91950610450711 13.7011554174148 6.95860626939825
## CXCL14     1.75210540159507 14.2949412930129 6.89414271521293
## ARPP21     6.91755936616036  14.833062933671 6.75377181556627
## CCK        5.70554628157203 16.4903035771684 6.54112112865473
## SYNPR      2.90912717097025 16.4483987038949 6.39464948644521
## KCNIP4-IT1 5.62880780753245 16.3288612093542 6.27955725014184
##                        tech   p.value       FDR
##                   <numeric> <numeric> <numeric>
## ENC1       9.39169571522048         0         0
## GAD1       9.55175748433385         0         0
## SPARCL1    7.71075701287247         0         0
## SLC6A1     10.1742506399352         0         0
## CHN1       6.74254914801657         0         0
## CXCL14     7.40079857779996         0         0
## ARPP21     8.07929111810478         0         0
## CCK        9.94918244851365         0         0
## SYNPR      10.0537492174497         0         0
## KCNIP4-IT1 10.0493039592124         0         0
par(mfrow=c(2,2))
smart_hist(dec$bio,breaks=30,xlab="Biological Variance")
smart_hist(dec$FDR,breaks=30,xlab="FDR")
smart_hist(log10(dec$FDR + 1e-6),breaks=30,xlab="log10(FDR + 1e-6)")
par(mfrow=c(1,1))

wtop = match(rownames(top_dec)[1:10], rownames(sce))
sce_sub = sce[wtop,]
dim(sce_sub)
## [1]    10 15858
plotExpression(sce_sub, features=1:10)

Here we only select approximately the top a few thousands highly variable genes (HVGs) based on tuning thresholds on dec$bio and dec$FDR from the earlier variance decomposition.

summary(dec$bio)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -10.012105  -0.073947  -0.005551  -0.007053   0.083276   8.221857
summary(dec$FDR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5732  1.0000  1.0000
dec1 = dec
dec1$bio[which(dec1$bio < 1e-4)]  = 1e-4
dec1$FDR[which(dec1$FDR < 1e-50)] = 1e-50

par(mfrow=c(2,2))
smart_hist(log10(dec1$bio), breaks=100, xlab="log10(bio)")
smart_hist(log10(dec1$FDR), breaks=100, main="", xlab="log10(FDR)")
plot(log10(dec1$bio), log10(dec1$FDR), xlab="log10(bio)", ylab="log10(FDR)", 
    pch=20, cex=0.5)
par(mfrow=c(1,1))

FDR_thres = 1e-20 
bio_thres = 0.1
FDR_keep = dec$FDR < FDR_thres
bio_keep = dec$bio > bio_thres
table(FDR_keep,bio_keep)
##         bio_keep
## FDR_keep FALSE  TRUE
##    FALSE 26064  4143
##    TRUE   2705  4745
w2kp = which(dec1$FDR < FDR_thres & dec1$bio > bio_thres)
summary(rowData(sce)$n_cells_by_counts[w2kp])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     145     842    2223    3645    5444   15722
sce_hvg = sce[w2kp,]
sce_hvg
## class: SingleCellExperiment 
## dim: 4745 15858 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(4745): A2M-AS1 AADAT ... ZXDA ZXDB
## rowData names(12): gene chromosome ... total_counts
##   log10_total_counts
## colnames(15858): F1S4_160106_001_B01 F1S4_160106_001_C01 ...
##   F2S4_170405_060_F01 F2S4_170405_060_H01
## colData names(44): sample_name sample_id ...
##   pct_counts_in_top_200_features pct_counts_in_top_500_features
## reducedDimNames(0):
## spikeNames(0):

Next we use the selected genes for PCA and use top 50 PCs for TSNE plot.

edat = t(as.matrix(logcounts(sce_hvg)))
edat = scale(edat)
dim(edat)
## [1] 15858  4745
edat[1:2,1:3]
##                        A2M-AS1      AADAT     AAED1
## F1S4_160106_001_B01 -0.1914392 -0.5988271 -0.202752
## F1S4_160106_001_C01 -0.1914392  1.2214955 -0.202752
ppk = propack.svd(edat,neig=50)
pca = t(ppk$d*t(ppk$u))
dim(pca)
## [1] 15858    50
df_hvg = smart_df(pca)
rownames(df_hvg) = NULL
names(df_hvg) = paste0("PC",seq(ncol(df_hvg)))
df_hvg        = smart_df(sample_name = colnames(sce_hvg), df_hvg)

all_vars = c("log10_total_features_by_counts", "sex", "brain_hemisphere",
    "brain_subregion", "facs_sort_criteria", "class", "cluster",
    "cell_type")
all_vars %in% names(colData(sce_hvg))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
df_hvg = cbind(df_hvg, colData(sce_hvg)[,all_vars])
dim(df_hvg)
## [1] 15858    59
df_hvg[1:2,c(1:3,51:ncol(df_hvg))]
## DataFrame with 2 rows and 12 columns
##           sample_name               PC1               PC2
##           <character>         <numeric>         <numeric>
## 1 F1S4_160106_001_B01  1.38570005225832 -17.0606948426093
## 2 F1S4_160106_001_C01 -20.5795338696423 -1.14700228587648
##                PC50 log10_total_features_by_counts         sex
##           <numeric>                      <numeric> <character>
## 1 -3.19594986256135               3.72065535655172           M
## 2 -2.07134103237301               3.91634865227546           M
##   brain_hemisphere brain_subregion facs_sort_criteria         class
##        <character>     <character>        <character>   <character>
## 1                L              L5      NeuN-positive     GABAergic
## 2                L              L5      NeuN-positive Glutamatergic
##               cluster   cell_type
##           <character> <character>
## 1 Inh L4-6 SST B3GAT2         Inh
## 2 Exc L5-6 RORB TTC12         Exc
set.seed(100)
date()
## [1] "Sun Feb 17 22:12:36 2019"
tsne = Rtsne(pca, pca = FALSE)
date()
## [1] "Sun Feb 17 22:15:05 2019"
df_tsne = smart_df(tsne$Y)
names(df_tsne) = paste0("HVG_TSNE",seq(ncol(tsne$Y)))
dim(df_tsne)
## [1] 15858     2
df_tsne[1:2,]
##    HVG_TSNE1 HVG_TSNE2
## 1  -2.909237 -39.09050
## 2 -14.032734  45.77537
df_hvg = smart_df(df_hvg, df_tsne)

ggplot_custom(DATA = df_hvg,X = "HVG_TSNE1",Y = "HVG_TSNE2",COL = "log10_total_features_by_counts",TYPE = "cont")

ggplot_custom(DATA = df_hvg,X = "HVG_TSNE1",Y = "HVG_TSNE2",COL = "sex",TYPE = "cat")

ggplot_custom(DATA = df_hvg,X = "HVG_TSNE1",Y = "HVG_TSNE2",COL = "brain_hemisphere",TYPE = "cat")

ggplot_custom(DATA = df_hvg,X = "HVG_TSNE1",Y = "HVG_TSNE2",COL = "brain_subregion",TYPE = "cat")

ggplot_custom(DATA = df_hvg,X = "HVG_TSNE1",Y = "HVG_TSNE2",COL = "facs_sort_criteria",TYPE = "cat")

ggplot_custom(DATA = df_hvg,X = "HVG_TSNE1",Y = "HVG_TSNE2",COL = "class",TYPE = "cat")

ggplot_custom(DATA = df_hvg,X = "HVG_TSNE1",Y = "HVG_TSNE2",COL = "cluster",TYPE = "cat")

ggplot_custom(DATA = df_hvg,X = "HVG_TSNE1",Y = "HVG_TSNE2",COL = "cell_type",TYPE = "cat")

reducedDims(sce_hvg) = SimpleList(PCA=pca, TSNE=tsne$Y)
sce_hvg
## class: SingleCellExperiment 
## dim: 4745 15858 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(4745): A2M-AS1 AADAT ... ZXDA ZXDB
## rowData names(12): gene chromosome ... total_counts
##   log10_total_counts
## colnames(15858): F1S4_160106_001_B01 F1S4_160106_001_C01 ...
##   F2S4_170405_060_F01 F2S4_170405_060_H01
## colData names(44): sample_name sample_id ...
##   pct_counts_in_top_200_features pct_counts_in_top_500_features
## reducedDimNames(2): PCA TSNE
## spikeNames(0):
saveRDS(list(sce_hvg=sce_hvg, df_hvg=df_hvg), 
        file.path(work_dir, "post_redDim_HVG.rds"))

rm(edat)
gc()
##              used    (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    6581384   351.5   10052352   536.9   10052352   536.9
## Vcells 1665071672 12703.5 2656405646 20266.8 2656399828 20266.8

5 Clustering

5.1 Kmeans

We first try clustering using kmeans.

# rds = readRDS("post_redDim_HVG.rds"); df_hvg = rds$df_hvg; sce_hvg = rds$sce_hvg; rm(rds)
set.seed(100)
all_num_clust = c(10:20)
df_hvg = df_hvg[,!grepl("^KM_",names(df_hvg))]

for(num_clust in all_num_clust){
    # num_clust = 15
    cat(paste0("KM with ",num_clust," clusters.\n"))
  
    kmeans_out = kmeans(reducedDim(sce_hvg,"PCA"),
                        centers = num_clust,
                        iter.max = 1e3,
                        nstart = 50,
                        algorithm = "MacQueen")
    print(kmeans_out[c("betweenss","tot.withinss")])

    km_label = paste0("KM_",num_clust)
    df_hvg[[km_label]] = as.factor(kmeans_out$cluster)

    print(table(df_hvg$cell_type, kmeans_out$cluster))
    print(ggplot_custom(DATA = df_hvg,X = "HVG_TSNE1",Y = "HVG_TSNE2", 
                        COL = km_label,TYPE = "cat"))
}
## KM with 10 clusters.
## $betweenss
## [1] 6057756
## 
## $tot.withinss
## [1] 5427852
## 
##          
##              1    2    3    4    5    6    7    8    9   10
##   Astro      0    0    0    0    0    1    0  287    0    0
##   Endo       0    0    6    0    0    3    0    0    0    0
##   Exc     3315    0    9    1 2829  502    0   12 2741 1064
##   Inh        2  830   14 1145    0  882 1276    1    1    0
##   Micro      0    0   58    0    0    4    0    1    0    0
##   Oligo      0    0  311    0    0    0    0    2    0    0
##   OPC        0    0  233    0    0    1    0    4    0    0
##   unknown   58    2   19    4   70   60   15   24   63    8

## KM with 11 clusters.
## $betweenss
## [1] 6302689
## 
## $tot.withinss
## [1] 5182919
## 
##          
##              1    2    3    4    5    6    7    8    9   10   11
##   Astro    287    0    0    1    0    0    0    0    0    0    0
##   Endo       0    0    0    8    0    0    0    0    1    0    0
##   Exc       11 2587    0  228  757 2995    0 2276    9 1609    1
##   Inh        1    1  837  717    0    1 1283    0   12    1 1298
##   Micro      0    0    0   62    0    0    0    0    1    0    0
##   Oligo      2    0    0    1    0    0    0    0  310    0    0
##   OPC        4    0    0  233    0    0    0    0    1    0    0
##   unknown   23   64    3   55    4   43   15   65   14   32    5

## KM with 12 clusters.
## $betweenss
## [1] 6583190
## 
## $tot.withinss
## [1] 4902417
## 
##          
##              1    2    3    4    5    6    7    8    9   10   11   12
##   Astro      0    0    0    1    0    0    0    0    0    0  287    0
##   Endo       0    0    1    7    0    1    0    0    0    0    0    0
##   Exc     2543 2155    8    2 1061  135 1742 2815    0    1   11    0
##   Inh        1    1   12    4    0 1098    1    0  848  903    1 1282
##   Micro      0    0    1   62    0    0    0    0    0    0    0    0
##   Oligo      0    0  310    1    0    0    0    0    0    0    2    0
##   OPC        0    0    1  234    0    0    0    0    0    0    3    0
##   unknown   54   62   14   13    8   39   14   74    3    5   22   15

## KM with 13 clusters.
## $betweenss
## [1] 6667206
## 
## $tot.withinss
## [1] 4818402
## 
##          
##              1    2    3    4    5    6    7    8    9   10   11   12   13
##   Astro      0    0    0    0    0    0    1    0    0  287    0    0    0
##   Endo       0    0    0    0    0    0    8    0    1    0    0    0    0
##   Exc     1711 1674 2052    0    0  267  291    0    8   12 2014 2443    1
##   Inh        2    0    1  823 1269    0  913    1   12    1    0    0 1129
##   Micro      0    0    0    0    0    0   61    0    1    1    0    0    0
##   Oligo      0    0    0    0    0    0    1    0  310    2    0    0    0
##   OPC        0    0    0    0    0    0    0  234    1    3    0    0    0
##   unknown   15   25   42    1   16   24   54    2   14   23   39   63    5

## KM with 14 clusters.
## $betweenss
## [1] 6824459
## 
## $tot.withinss
## [1] 4661149
## 
##          
##              1    2    3    4    5    6    7    8    9   10   11   12   13
##   Astro      0    0    0    0  287    0    0    0    0    0    0    0    1
##   Endo       0    0    0    0    0    0    0    0    1    0    0    0    8
##   Exc        0    1 1752 1580   12    0 1671    0    8 1750    0 2043    3
##   Inh      750  832    0    2    0 1246    1 1053   12    0  250    1    4
##   Micro      0    0    0    0    0    0    0    0    1    0    0    0   62
##   Oligo      0    0    0    0    2    0    0    0  310    0    0    0    1
##   OPC        0    0    0    0    3    0    0    0    1    0    0    0  234
##   unknown   10    4   26   14   22   16   94   13   14   28    0   37   17
##          
##             14
##   Astro      0
##   Endo       0
##   Exc     1653
##   Inh        0
##   Micro      0
##   Oligo      0
##   OPC        0
##   unknown   28

## KM with 15 clusters.
## $betweenss
## [1] 7112286
## 
## $tot.withinss
## [1] 4373322
## 
##          
##              1    2    3    4    5    6    7    8    9   10   11   12   13
##   Astro      0    0    0    0    0    0    0    0    0    0    0    0    0
##   Endo       0    0    0    0    0    0    0    0    1    0    0    0    0
##   Exc        0    0 1741  327  725 1639 1315    1    8    0 1493 1581 1629
##   Inh      877 1051    0    0    0    0    1  832   12 1371    1    2    0
##   Micro      0    0    0    0    0    0    0    0    1    0    0    0    0
##   Oligo      0    0    0    0    0    0    0    0  310    0    0    0    0
##   OPC        0    0    0    0    0    0    0    0    1    0    0    0    0
##   unknown    3   11   35    7    4   28   28    4   14   24   87   14   25
##          
##             14   15
##   Astro    287    1
##   Endo       0    8
##   Exc       11    3
##   Inh        1    3
##   Micro      0   62
##   Oligo      2    1
##   OPC        3  234
##   unknown   22   17

## KM with 16 clusters.
## $betweenss
## [1] 7233561
## 
## $tot.withinss
## [1] 4252047
## 
##          
##              1    2    3    4    5    6    7    8    9   10   11   12   13
##   Astro      0    1  287    0    0    0    0    0    0    0    0    0    0
##   Endo       0    8    0    0    0    0    0    1    0    0    0    0    0
##   Exc        0    3   11    0 1516 1580    1    8  727 1306 1671  260    0
##   Inh      940    4    1 1059    0    2  832   12    0    1    0    0 1299
##   Micro      0   62    0    0    0    0    0    1    0    0    0    0    0
##   Oligo      0    1    2    0    0    0    0  310    0    0    0    0    0
##   OPC        0  234    3    0    0    0    0    1    0    0    0    0    0
##   unknown    8   16   22   12   25   14    4   14    4   30   28   22   18
##          
##             14   15   16
##   Astro      0    0    0
##   Endo       0    0    0
##   Exc     1429  326 1635
##   Inh        1    0    0
##   Micro      0    0    0
##   Oligo      0    0    0
##   OPC        0    0    0
##   unknown   75    7   24

## KM with 17 clusters.
## $betweenss
## [1] 7332195
## 
## $tot.withinss
## [1] 4153413
## 
##          
##              1    2    3    4    5    6    7    8    9   10   11   12   13
##   Astro      0    0    0  287    0    0    0    0    1    0    0    0    0
##   Endo       0    0    1    0    0    0    0    0    8    0    0    0    0
##   Exc        0 1123    8   11  326  760 1332 1605    3    1 1324    0 1656
##   Inh     1140    0   12    1    0    1    1    1    3  832    0  508    0
##   Micro      0    0    1    0    0    0    0    0   62    0    0    0    0
##   Oligo      0    0  310    2    0    0    0    0    1    0    0    0    0
##   OPC        0    0    1    3    0    0    0    0  234    0    0    0    0
##   unknown   14   19   14   22    7   20   82   28   16    4   23   12   32
##          
##             14   15   16   17
##   Astro      0    0    0    0
##   Endo       0    0    0    0
##   Exc      728 1596    0    0
##   Inh        0    1  601 1050
##   Micro      0    0    0    0
##   Oligo      0    0    0    0
##   OPC        0    0    0    0
##   unknown    4   14    0   12

## KM with 18 clusters.
## $betweenss
## [1] 7517664
## 
## $tot.withinss
## [1] 3967944
## 
##          
##              1    2    3    4    5    6    7    8    9   10   11   12   13
##   Astro      0    0    0    0    0    0    0    0    0    0    0    1    0
##   Endo       0    0    0    1    1    0    0    0    0    0    0    7    0
##   Exc        0 1591 1513    0    8 1307  728    0  260    2  326    0 1641
##   Inh      607    2    0 1039   11    1    0  918    0    4    0    1    0
##   Micro      0    0    0    0    0    0    0    0    0   62    0    0    0
##   Oligo      0    0    0    0  310    0    0    0    0    1    0    0    0
##   OPC        0    0    0    0    1    0    0    0    0    0    0  234    0
##   unknown    0   14   24   15   12   31    4   11   22    9    7    7   24
##          
##             14   15   16   17   18
##   Astro      0  287    0    0    0
##   Endo       0    0    0    0    0
##   Exc     1429   11    1    0 1656
##   Inh        1    1  833  733    0
##   Micro      0    1    0    0    0
##   Oligo      0    2    0    0    0
##   OPC        0    3    0    0    0
##   unknown   75   21    4   15   28

## KM with 19 clusters.
## $betweenss
## [1] 7539736
## 
## $tot.withinss
## [1] 3945871
## 
##          
##              1    2    3    4    5    6    7    8    9   10   11   12   13
##   Astro      0    0    0    0    0    0    0    0  287    0    0    0    0
##   Endo       0    0    0    0    0    0    0    0    0    0    1    0    1
##   Exc     1614    1  326 1401    0  260  726  615   11 1325   12 1336    8
##   Inh        0  761    1    0  579    0    0    0    1    1  707    0   12
##   Micro      0    0    0    0    0    0    0    0    0    0    0    0    1
##   Oligo      0    0    0    0    0    0    0    0    2    0    0    0  310
##   OPC        0    0    0    0    0    0    0    0    3    0    0    0    1
##   unknown   28    3    7   23    0   22    4    7   22   29   26   66   14
##          
##             14   15   16   17   18   19
##   Astro      0    0    1    0    0    0
##   Endo       0    0    7    0    0    0
##   Exc     1510    0    2    0 1326    0
##   Inh        0  870    3  669    2  545
##   Micro      0    0   62    0    0    0
##   Oligo      0    0    1    0    0    0
##   OPC        0    0  234    0    0    0
##   unknown   25    7   10   13   13    4

## KM with 20 clusters.
## $betweenss
## [1] 7673430
## 
## $tot.withinss
## [1] 3812178
## 
##          
##              1    2    3    4    5    6    7    8    9   10   11   12   13
##   Astro      0    0    0    0  287    0    0    0    0    0    0    0    0
##   Endo       0    0    0    0    0    0    0    0    0    1    0    0    0
##   Exc      260    0  763    0   11 1214  730  326 1152    8    1 1041    0
##   Inh        0  757    1  871    1    1    0    0    0   11  818    1 1004
##   Micro      0    0    0    0    1    0    0    0    0    0    0    0    0
##   Oligo      0    0    0    0    2    0    0    0    0  310    0    0    0
##   OPC        0    0    0    0    3    0    0    0    0    1    0    0    0
##   unknown   22    1   21    7   21   70    4    7   20   12    4    9   11
##          
##             14   15   16   17   18   19   20
##   Astro      0    0    0    0    0    1    0
##   Endo       0    0    0    0    0    7    1
##   Exc        2 1426 1311 1036 1192    0    0
##   Inh        4    0    0    1    0    1  680
##   Micro     62    0    0    0    0    0    0
##   Oligo      1    0    0    0    0    0    0
##   OPC        0    0    0    0    0  234    0
##   unknown    9   21   23    9   25    6   21

5.2 SC3

Next we try to clustering cell using SC3. Code used here is based on this link. Since SC3 is computationally much more expensive, we only try three number of clusters, 5, 10, and 15.

Next we compare the clustering results from SC3 and cell types.

It looks SC3 with 5 clusters separate GABAergic, Glutamatergic neurons, and non-neurons, but clusering results with 10 or 15 clusters are noisy. Here we at a bit more details for the results with 5 clusters.

Finally we save the sce object, sce_hvg object, and the clustering results.

saveRDS(sce, file.path(work_dir, "final_sce.rds"))
saveRDS(sce_hvg, file.path(work_dir, "final_sce_hvg.rds"))
saveRDS(df_hvg,  file.path(work_dir, "final_hvg_clust.rds"))

6 Session information

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux Server 7.6 (Maipo)
## 
## Matrix products: default
## BLAS: /nas/longleaf/home/pllittle/downloads/R-3.5.1/lib64/R/lib/libRblas.so
## LAPACK: /nas/longleaf/home/pllittle/downloads/R-3.5.1/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] SC3_1.10.1                  Rtsne_0.15                 
##  [3] svd_0.4.1                   data.table_1.12.0          
##  [5] limma_3.38.3                scran_1.10.2               
##  [7] scater_1.10.1               ggplot2_3.1.0              
##  [9] SingleCellExperiment_1.4.1  SummarizedExperiment_1.12.0
## [11] DelayedArray_0.8.0          BiocParallel_1.16.5        
## [13] matrixStats_0.54.0          Biobase_2.42.0             
## [15] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
## [17] IRanges_2.16.0              S4Vectors_0.20.1           
## [19] BiocGenerics_0.28.0         BiocInstaller_1.32.1       
## [21] rmarkdown_1.11              markdown_0.9               
## [23] knitr_1.21                 
## 
## loaded via a namespace (and not attached):
##  [1] ggbeeswarm_0.6.0         colorspace_1.4-0        
##  [3] class_7.3-15             dynamicTreeCut_1.63-1   
##  [5] XVector_0.22.0           BiocNeighbors_1.0.0     
##  [7] mvtnorm_1.0-8            codetools_0.2-16        
##  [9] doParallel_1.0.14        robustbase_0.93-3       
## [11] cluster_2.0.7-1          pheatmap_1.0.12         
## [13] shiny_1.2.0              HDF5Array_1.10.1        
## [15] rrcov_1.4-7              compiler_3.5.1          
## [17] assertthat_0.2.0         Matrix_1.2-15           
## [19] lazyeval_0.2.1           later_0.7.5             
## [21] htmltools_0.3.6          tools_3.5.1             
## [23] bindrcpp_0.2.2           igraph_1.2.3            
## [25] gtable_0.2.0             glue_1.3.0              
## [27] GenomeInfoDbData_1.2.0   reshape2_1.4.3          
## [29] dplyr_0.7.8              doRNG_1.7.1             
## [31] Rcpp_1.0.0               gdata_2.18.0            
## [33] iterators_1.0.10         DelayedMatrixStats_1.4.0
## [35] xfun_0.4                 stringr_1.3.1           
## [37] mime_0.6                 irlba_2.3.3             
## [39] rngtools_1.3.1           gtools_3.8.1            
## [41] WriteXLS_4.0.0           statmod_1.4.30          
## [43] edgeR_3.24.3             DEoptimR_1.0-8          
## [45] zlibbioc_1.28.0          scales_1.0.0            
## [47] promises_1.0.1           rhdf5_2.26.2            
## [49] RColorBrewer_1.1-2       yaml_2.2.0              
## [51] gridExtra_2.3            pkgmaker_0.27           
## [53] stringi_1.2.4            pcaPP_1.9-73            
## [55] foreach_1.4.4            e1071_1.7-0.1           
## [57] caTools_1.17.1.1         bibtex_0.4.2            
## [59] rlang_0.3.1              pkgconfig_2.0.2         
## [61] bitops_1.0-6             evaluate_0.13           
## [63] lattice_0.20-38          ROCR_1.0-7              
## [65] purrr_0.3.0              Rhdf5lib_1.4.2          
## [67] bindr_0.1.1              labeling_0.3            
## [69] cowplot_0.9.4            tidyselect_0.2.5        
## [71] plyr_1.8.4               magrittr_1.5            
## [73] R6_2.3.0                 gplots_3.0.1.1          
## [75] pillar_1.3.1             withr_2.1.2             
## [77] RCurl_1.95-4.11          tibble_2.0.1            
## [79] crayon_1.3.4             KernSmooth_2.23-15      
## [81] viridis_0.5.1            locfit_1.5-9.1          
## [83] grid_3.5.1               digest_0.6.18           
## [85] xtable_1.8-3             httpuv_1.4.5.1          
## [87] munsell_0.5.0            beeswarm_0.2.3          
## [89] registry_0.5             viridisLite_0.3.0       
## [91] vipor_0.4.5

Reference

Lun, Aaron TL, Karsten Bach, and John C Marioni. 2016. “Pooling Across Cells to Normalize Single-Cell Rna Sequencing Data with Many Zero Counts.” Genome Biology 17 (1). BioMed Central: 75.